Optimizers

Prerequisites

The goal is to find the argument \(x\) of a function \(f\) that minimizes the value \(f(x)\)

\(f(x)\) - function of which the local (global) minimum shall be found

\(f'(x)\) - derivative function of \(f(x)\)

\(x_{0}\) - initial value (it can be picked randomly)

The idea is the create a sequence of points \((x_{0}, x_{1}, x_{2}, ...)\) that would converge to the local (global) minimum of the function based on its derivative. We assume that \(f\) is a differentiable function (mathematically it is a drastic assumption but in practice it does not cause any problems).

In order to find a maximum of a function \(f\) one can simply find the minimum of a function -\(f\). Hence, we can consider only minimization problem.

Effectively, we need to define the difference between \(n\)-th and (\(n+1\))th element of the sequence above, i.e.

\(x_{n+1} := x_{n} + \Delta x_{n}\)

SGD (Stochastic gradient descent)

\(\Delta x_{n}:= - \eta \cdot f'(x_{n})\)

\(\eta\) - learning rate (traditional default value for the learning rate is 0.1 or 0.01)

\(x_{n} = x_{n-1} + \Delta x_{n}\)

Init example

Example 1

Formula

\(f(x)=0.04 \cdot x^{2}\)

\(f'(x)=0.08 \cdot x\)

Function plot

Example 2

Formula

\(f(x)= - \frac{1}{\sqrt{2 \cdot \pi}} \cdot e^{-\frac{1}{2}x^{2}}\)

\(f'(x)= \frac{1}{\sqrt{2 \cdot \pi}} \cdot e^{-\frac{1}{2}x^{2}} \cdot x\)

Function plot

Example 3

Formula

\(f(x) = -0.01 \cdot x \text{ for } x < -1\)

\(f(x) = x^{2} \text{ for} -1 \leq x\)

\(f'(x)=-0.01 \text{ for } x < -1\)

\(f'(x) = 2 \cdot x \text{ for } -1 \leq x\)

Function plot

Momentum

\(\Delta x_{n} = \beta \cdot \Delta x_{n-1} - \eta \cdot f'(x_{n})\)

\(x_{n} = x_{n-1} + \Delta x_{n}\)

Stochastic gradient descent with momentum remembers the update x at each iteration, and determines the next update as a linear combination of the gradient and the previous update.

The name momentum stems from an analogy to momentum in physics: the weight vector w, thought of as a particle traveling through parameter space,[17] incurs acceleration from the gradient of the loss (“force”). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of artificial neural networks for several decades.[20] The momentum method is closely related to underdamped Langevin dynamics, and may be combined with Simulated Annealing.

Init example

Example 1

Formula

\(f(x) = -0.01 \cdot x \text{ for } x < -1\)

\(f(x) = x^{2} \text{ for } -1 \leq x\)

\(f'(x)=-0.01 \text{ for } x < -1\)

\(f'(x) = 2 \cdot x \text{ for } -1 \leq x\)

Example 1b

Formula

\(f(x) = -0.01 \cdot x \text{ for } x < -1\)

\(f(x) = x^{2} \text{ for } -1 \leq x\)

\(f'(x)=-0.01 \text{ for } x < -1\)

\(f'(x) = 2 \cdot x \text{ for } -1 \leq x\)

\(\beta = 0.98\)

Example 2

Formula

\(f(x)= - \frac{1}{\sqrt{2 \cdot \pi}} \cdot e^{-\frac{1}{2}x^{2}}\)

\(f'(x)= \frac{1}{\sqrt{2 \cdot \pi}} \cdot e^{-\frac{1}{2}x^{2}} \cdot x\)

Example 3

Formula

\(f(x) = -0.01 \cdot x \text{ for } x < -1\)

\(f(x) = x^{2} \text{ for } -1 \leq x\)

\(f'(x)=-0.01 \text{ for } x < -1\)

\(f'(x) = 2 \cdot x \text{ for } -1 \leq x\)

\(x_{0} = -10\)

AdaGrad (adaptive gradient)

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter learning rate, first published in 2011.[23] Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.

Sequence of gradients: \(g_{i} = f'(x_{i})\)

\(G_{t} = \sum_{i = 1}^{t} g_{i}^{2}\)

\(\Delta x_{n} = - \frac{\eta}{\sqrt{G_{t}}} \cdot g_{t}\)

\(x_{n} = x_{n-1} + \Delta x_{n}\)

Example 1

\(f'(x)=0.08 \cdot x\)

Results

Example 2

\(f'(x)=0.08 \cdot x\)

Results

AdaDelta (adaptive delta)

Adadelta optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks:

-The continual decay of learning rates throughout training. -The need for a manually selected global learning rate.

Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done. Compared to Adagrad, in the original version of Adadelta you don’t have to set an initial learning rate.

Sequence of squared gradients:

\(g_{n} = f'(x_{n})\)

\(G_{n} = \rho_{1} \cdot G_{n-1} + (1- \rho_{1}) \cdot g_{n}^{2}\)

\(E[\Delta x_{n}^{2}] = \rho_{2} \cdot E[\Delta x_{n-1}^{2}] + (1- \rho_{2}) \cdot \Delta x_{n}^{2}\)

\(\Delta x_{n} = - \frac{\sqrt{E[\Delta x^{2}_{n-1}] + \epsilon}}{\sqrt{G_{n} + \epsilon}} \cdot g_{n}\)

Adam (Adaptive Moment Estimation)

the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.

Sequence of squared gradients:

\(g_{n} = f'(x_{n})\)

\(m_{n} = \beta_{1} \cdot m_{n-1} + (1 - \beta_{1}) \cdot g_{n}\)

\(v_{n} = \beta_{2} \cdot v_{n-1} + (1 - \beta_{2}) \cdot g^{2}_{n}\)

\(\hat{m}_{n} = \frac{m_{n}}{1 - \beta_{1}}\)

\(\hat{v}_{n} = \frac{v_{n}}{1 - \beta_{2}}\)

\(\Delta x_{n} = - \eta \cdot \frac{\hat{m}_{n}}{\sqrt{\hat{v}_{n}+ \epsilon}}\)

2-dimensions

This is main and it contains a reference to twodim.